SET UP

IMPORT PACKAGES AND CONFIGURE SETTINGS

LOAD AND INSPECT DATA

Dataset is from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/heart+disease

Attribute definition:

TRANSFORM

The target variable, heart_diease, currently has 5 unique values. Values between 1 and 4 indicate heart disease presence while a value of 0 indicates heart disease absence. Let's make this a binary classification problem, where 1 means presence and 0 means absence

In the above section, variables ca and thal contain records with "?." Let's drop these records since they only make up 2% of the entire dataset

EDA

DISTRIBUTION

From the above plots, we can see that there is a higher probability of having heart disease with increasing age, blood pressure, and cholesterol. According to the CDC, a normal blood pressure is less than 120/80 mmHg and a healthy serum cholesterol is less than 200 mg/dL. We see more individuals with heart disease that are above these metrics

CORRELATION

While boosting algorithms are not affected by multicollinearity, it is good practice to check if it exists and remove redundant features

TRAIN AND TEST SET

Create two different train and test sets. One set encodes categorical variables while the other one does not - this is to demonstrate CatBoost's easy processing of categorical variable(s)

MODEL

The purpose of this project is to compare and evaluate boosting algorithms XGBoost, LightGBM, and CatBoost. Peers on Kaggle use a variety of algorithms and achieve a model accuracy in the low to mid-80s. Let's if see this can be achieved. In addition to accuracy, precision, recall, and F1 score are examined

XGBOOST

LIGHTGBM

CATBOOST

Catboost outfperformed XGBoost and LightGBM from an AUC, accuracy, recall, precision and f1-score perspective

MODEL FAIRNESS

Let's check CatBoost's fairness, to ensure the model did not inherit any bias

The selection rate for males is higher than females. This should be okay for a few reasons. First, there are more male patients than females in the dataset. Second, according to John Hopkins and other similar articles, men are more susceptible to heart disease than women